At the Wikimedia Foundation we use A/B testing to optimize the number of donations per impressions from our fundraising banners.

Although classical Null Hypthothesis tests are probably still the most widespread technique for online A/B tests, I would argue that they are not the most appropriate for the web optimization use case. Null Hypthothesis testing was designed for situations in which:

  1. the goal is to control the probability of declaring a difference in populations means when there is none

Most academic scientists care about this because it corresponds to false discoveries, which are embarassing at best. But in the banner optimization setting, we actually don’t care about declaring the banners different when they are not. When the banners are equivalent in terms of click through rates (CTR) there is no bad decision. It may set the designers down the wrong path, because they think that a change they made to the control banner is a big win. The choice, however, has no impact on the expected revenue. What we really want, is to control the probability of declaring the worse banner the winner when the banners are in fact different.

  1. the sample size is predetermined or fixed in advance

When running web experiments, we can often choose how long to run the experiment. Null Hypothesis testing does not take advantage of this fact. Instead, the best practice is to calculate the sample size in advance of running the test. To determine the sample size, you need to have an estimate of the CTR of your control banner and decide on the minumum effect size you want to detect (MDE). You also need to choose the desired percent of the time a difference will be detected, assuming one does not exist (significance or alpha) and percent of the time the minimum effect size (MDE) will be detected, assuming it exists (power or beta). For more information on , check out the Wikipedia page on smaple size determination. For an online calculator, see Evan Miller's blog.

Ceteris paribus, the sample size increases as you decrease the MDE. Now imagine a scenario in which you decided you want to be able to detect a 5% difference in click through rates. If the new banner is actually 50% better than the control, you have set yourself up for losing a lot of revenue on the portion of traffic exposed to the the control. Instead of commiting to a sample size in advance, we would really like our testing procedure to have an adaptive sample size. The larger the true difference in CTRs the shorter it should run. This should be possible since larger differences are easier detect with less data.

It may be tempting (and many people have been tempted) to just run multiple null hypothesis tests as the data are coming in and decide to stop the test early or let it keep running. Some analysts even look at the p-value as a function of the sample size and stop the test if they feel like the p-vlaue has converged. There can be many valid reason's for peeking at test results that trump maintaining staitical soundness. One exmaple, is to check for glitches in the testing framework. The problem with peeeking, hosever, is that it corrupts the test. The more often you peek the higher the effective probability of declaring a difference in CTRs when there is none.

To summarize, we need a testing procedure that:

  1. takes advanatage of the fact that the amount of data we need to determine if the new banner is better than the control decreases as the difference increases, while

  2. controling the probability of declaring the worse banner to be the winner.

What follows is a proposed method to meet these ends. It relies heavily on simulation and numeric bayesian statistics.

As a warm up, consider the following algorithm: ALG_1 Evaluate a Null Hypothesis Test on your samples every n records until you reach significance or hit the predetermined sample size. If you stopped because you hit significance, declare the banner with the higher CTR at termination to be the winner. If you stopped because you hit the max sample size, declare the winner to be unknown.

I stated above that this kind of p_value peeking corrupts the classic null hypothesis test. When the peeking is ad hoc its very hard to measure its effect, but here the peeking is is done in a very controlled way and we can use simulation to answer questions like:

How does the probability of picking banner A change as a function of the percent lift in CTR that A gives over B.

We can simulate impressions of banner B by sampling from a bernoulli distribution $Bernoulli(CTR_B)$. We can emulate impressions from synthetic banner A with y percent lift over banner B, by sampling from $Bernoulli(CTR_B * (100+y)/100)$. For a given percent lift y, we can run the procedure described above k times to get an estimate for the probability of declaring A the winner and the distribution over the number of samples need to terminate.

In practice, we would not know CTR_B, but we can gather some samples and get an estimate CTR_BHAT and then sample from #Bernoulli(CTR_BHAT). Even better would be to form a confidence interval and use the lower bound, which I will explain later.

max_sample_size = compute_sample_size(control_CTR, mde, alpha, power) 
samples = []
while p_value(samples) < alpha:
    samples.append(get_samples(n))

To motivate the design choices made, lets Consider the following algorithm:

  1. Run the Control Banner for a while and form a 1-alpha confidence interval on its CTR

Most of the time the control is the called the control because you have run it before and you have historical data on how it performed. In practice, the statistics underlying most behaviour on the web are not stationary, so beware that the data used for your estimate is reflective of the conditions under which you will actually test the banners.

  1. Create synthetic treatment banners (data drawn from

In [6]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import pandas as pd
pd.options.display.mpl_style = 'default'
import numpy as np
from abtest_util import SimStream, DonationProb, EmpiricalDonationProb
from bayesian_abtest import CredibilityABTest, CostABTest
from nh_abtest import NHABTest, samples_per_branch_calculator
from abstract_abtest import expected_results, expected_results_by_lift


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

In [9]:
#Ground Truth Click Through Rates
p_B = DonationProb(0.20, [1,], [1.0,])
p_A = DonationProb(0.23, [1,], [1.0,])

# Estimate Control CTR From Historical Data
n = 5000
hist_data_B = SimStream(p_B).get_next_records(n)
p_hat = EmpiricalDonationProb(hist_data_B, p_B.values)
ci =  p_hat.p_donate_ci()
print "CI over control:",  ci
interval = round(1.0/ci[0])*50
print "Evalaution Interval", interval

#interest in lift vlaues:
lifts = [-0.20, -0.10, -0.05, -0.025, -0.015,0, 0.015, 0.025, 0.05, 0.10, 0.20]


CI over control: (0.18791449961904458, 0.1986, 0.20983639012346994)
Evalaution Interval 250.0

In [35]:
# Set Up NH  AB Test
mde = 0.05
alpha = 0.05
power = 0.95
max_run = samples_per_branch_calculator(p_hat, mde, alpha, power)
print max_run

iters = 1000
expected_results_by_lift(NHABTest,[SimStream(p_A), SimStream(p_B), 250, max_run, alpha], iters, p_hat, lifts)


41950.4509031
-0.2
-0.1
-0.05
-0.025
-0.015
0
0.015
0.025
0.05
0.1
0.2
[-20.  -10.   -5.   -2.5  -1.5   0.    1.5   2.5   5.   10.   20. ]
Out[35]:
% lift A over B P(Choosing A) Median P(Unknown) Median Avg Time
0 -20.0 0.000 0.000 924.75
1 -10.0 0.009 0.000 2909.75
2 -5.0 0.030 0.019 9485.75
3 -2.5 0.085 0.266 20150.25
4 -1.5 0.118 0.457 24625.50
5 0.0 0.192 0.583 27605.00
6 1.5 0.433 0.478 25842.25
7 2.5 0.643 0.284 20845.00
8 5.0 0.929 0.022 9930.25
9 10.0 0.992 0.000 3312.00
10 20.0 0.999 0.000 1005.00

As an example, lets set the true CTR of the control banner B to be 0.2. In practice, we will not know the true value of B's CTR. We will neeed to estimate it either from historical data or by running it by itself for a while. One concern might be that the simulated results are quite sensitive to errors in estimation. To account for this we can form a confidence interval around CTR_B and use the lower and uppper bounds of the confidence interval to anker the simulation and see how the results differ. If the difference is too large, we will need to get more data on Banner B to form a tighter interval. For the following experiments I simulated 5000 impressions of a banner with CTR = 0.2 and formed a 95% credible interval:


In [19]:
import numpy as np
from numpy.random import beta, binomial

impressions = 5000
clicks = binomial(impressions, 0.2)
CTR_B_distribution = beta(clicks, impressions-clicks, 10000)

print 'Lower Bound %0.3f' % np.percentile(CTR_B_distribution, 2.5)
print 'Upper Bound %0.3f' % np.percentile(CTR_B_distribution, 97.5)


Lower Bound 0.192
Upper Bound 0.214

Lets set MDE=0.05, beta = 0.95, alpha = 0.5 and do a Null hypothesis every n = 250 records. For the figures below, I ran the simulation k = 1000 times.

The first thing to notice is that the probability of declaring A to be the better than B when the lift is 0 is 0.18. This means that the effective significance level of our procedure is only 0.36, although the single null hypothesis tests where testing at significance 0.05. Here you have the corrupting power of peeking at p_vlaues in action! But this algorithm has its redeeming qualities. Look at the expected number of samples needed as a function of the percent lift in CTR. We can detect a 10% lift with probability 0.988 with only an average of 3500 records, one tenth of the classicly determined sample size!

To get a deeper understanding of what the effect of peeking is, lets try the extremes: testing at every new pair of records and testing only once:


In [37]:
# Set Up NH  AB Test
mde = 0.05
alpha = 0.05
power = 0.95
max_run = samples_per_branch_calculator(p_hat, mde, alpha, power)
print max_run

iters = 5000
expected_results_by_lift(NHABTest,[SimStream(p_A), SimStream(p_B), 41951, max_run, alpha], iters, p_hat, lifts)


41950.4509031
-0.2
-0.1
-0.05
-0.025
-0.015
0
0.015
0.025
0.05
0.1
0.2
[-20.  -10.   -5.   -2.5  -1.5   0.    1.5   2.5   5.   10.   20. ]
Out[37]:
% lift A over B P(Choosing A) Median P(Unknown) Median Avg Time
0 -20.0 0.0000 0.0000 41951
1 -10.0 0.0000 0.0000 41951
2 -5.0 0.0000 0.0476 41951
3 -2.5 0.0000 0.5536 41951
4 -1.5 0.0016 0.8042 41951
5 0.0 0.0272 0.9494 41951
6 1.5 0.1878 0.8106 41951
7 2.5 0.4264 0.5734 41951
8 5.0 0.9464 0.0536 41951
9 10.0 1.0000 0.0000 41951
10 20.0 1.0000 0.0000 41951

Here we only tested once and after the classically computed number of records. We see that when the lift is 0, there is roughly a 5% chance of declaring a winner. This false discovery rate corresponds to the significance level of the Null Hypothesis test, which we set to 5%. We also specified that we want to ba able to detect a 5% effect 95% of the time. In the simulation, we detected a 5% gain over the control CTR 94.5% of the time and %5 drop 94.9% of the time. The results of the simulation match the set up of the Null hyothesis test as expected. Now, when to go the the extreme of testing after every pair of records, we make false discoveries %65.2 of the time. We declare a banner with a 5% lift over the control as the winner only 81.5% of the time and and banner with a %-5% list as the winner 16.7% of the time. The advanage of testing every record is that the expected run time of the test is minimized. The takeaway is that you can trade off accuracy with sample size by changing the interval at which you test.


In [16]:
# Set Up NH  AB Test
mde = 0.05
alpha = 0.05
power = 0.95
max_run = samples_per_branch_calculator(p_hat, mde, alpha, power)
print max_run

iters = 1000
expected_results_by_lift(NHABTest,[SimStream(p_A), SimStream(p_B), 1, max_run, alpha], iters, p_hat, lifts)


41069.3920898
-0.2
-0.1
-0.05
-0.025
-0.015
0
0.015
0.025
0.05
0.1
0.2
Out[16]:
% lift A over B P(Choosing A) P(Unknown) Avg Time
0 -0.200 0.070 0.000 518.563
1 -0.100 0.115 0.000 1807.949
2 -0.050 0.169 0.016 5645.214
3 -0.025 0.225 0.174 12264.695
4 -0.015 0.240 0.251 13660.303
5 0.000 0.326 0.329 15751.379
6 0.015 0.486 0.279 15103.576
7 0.025 0.593 0.196 12534.981
8 0.050 0.815 0.011 5628.926
9 0.100 0.860 0.000 1949.256
10 0.200 0.934 0.000 607.010

In [ ]:


In [ ]:


In [ ]:

The algorithm I will describe next has the same form but we will swap out the Null Hypthesis test withh a different stopping criterium. Instead of stopping when the probability of the observed difference in means under the assumption that the banners have the same CTR is below our significance level alpha, (wow, so convoluted) lets stop when the expected cost associated with our decision is low. This idea comes from Chris Stuchio's recent post

In particular, there is something strange going on when we run the simulation at 0 lift. We are waiting around for the null hyopothesis to falsely declare significance. Wouldn't it be better to terminate the experiment when we are reasonably sure that the the banners where the same and that the expected cost of choosing one over the other was small?

To achieve this objective, we will need to switch to the bayesian formalism, in which we treat our click through rates not as unknown fixed parmeters, but as random variables. For a primer on bayesian AB testing, I recommend Sergey Feldman's blog post. Using the techiques described in Feldman's blog, we can form a numeric representation of the joint posterior distribution over the click through rates CTR_A and CTR_B, which will be crucial.

Say that after collecting N samples, you find banner A has a higher empirical click through rate than banner B.

To decide whether to stop the test, we could see how much of a chance remains that B is better that A and stop if P(CTR_B > CTR_A) is below a threshold. The rub is that if the banners really where the same, then P(CTR_B < CTR_A) should be 0.5. We are in the same situation as before of waiting for a false result to terminate our test.

To avoid this, consider the following cost function:

f(CTR_A, CTR_b) = 0 if CTR_A > CTR_B

          (CTR_B - CTR_A)  if CTR_B > CTR_A


We incurr 0 cost if we pick the better banner and incurr the difference in CTR_R's if we make the wrong decision. We can run the test until the expected cost $E[f(CTR_A, CTR_b)]$ drops below a threshold.

The code snippet below shows how to compute the expected cost through numerical integration, given the number of clicks and impressions for both banners:


In [14]:
import numpy as np
from numpy.random import beta

def expected_cost(a_clicks, a_impressions, b_clicks, b_impressions, num_samples=10000):
    a_dist = beta(a_clicks, a_impressions - a_clicks, num_samples) 
    b_dist = beta(b_clicks, b_impressions - b_clicks, num_samples) 
    return np.maximum((b_dist-a_dist), 0).mean()

In [15]:
expected_cost(10, 20, 12, 20)


Out[15]:
0.1248811195960521

Chris Stucchio recently derived a closed form solution for this cost function. This is certainly more compuationally efficient, but the advanatge of the numeric integration is that it is easy to expeirent with slight modification to the cost function. For example you could experiment with incurring the relative cost $(CTR_B - CTR_A)/ CTR_A$ instead of the absolute cost $CTR_B - CTR_A$.

Below are the results for computing the expected cost every 250 records and stopping if the expected cost drops below 0.001. Picking the the right threshold is a bit difficult. I tried to find a value where the probability of choosing banner A as function of lift roughly matches up with the previous simulation where we used the Null hypothesis test as a stopping criterium.


In [40]:
# Set Up Cost  AB Test

cost = 0.0002
max_run = float('inf')  #this one is so clean it doesnt need a max_run arg

iters = 1000
expected_results_by_lift(CostABTest,[SimStream(p_A), SimStream(p_B), 250, max_run, cost], iters, p_hat, lifts)


-0.2
-0.1
-0.05
-0.025
-0.015
0
0.015
0.025
0.05
0.1
0.2
[-20.  -10.   -5.   -2.5  -1.5   0.    1.5   2.5   5.   10.   20. ]
Out[40]:
% lift A over B P(Choosing A) Median P(Unknown) Median Avg Time
0 -20.0 0.000 0 880.75
1 -10.0 0.007 0 2366.75
2 -5.0 0.025 0 6009.75
3 -2.5 0.121 0 13803.25
4 -1.5 0.172 0 20308.00
5 0.0 0.471 0 32284.50
6 1.5 0.803 0 20309.25
7 2.5 0.897 0 13176.75
8 5.0 0.959 0 6376.50
9 10.0 0.991 0 2602.50
10 20.0 1.000 0 1063.75

In [39]:
# Set Up Cost  AB Test

cost = 0.0002
max_run = float('inf')  #this one is so clean it doesnt need a max_run arg

iters = 1000
expected_results_by_lift(CostABTest,[SimStream(p_A), SimStream(p_B), 250, 41951, cost], iters, p_hat, lifts)


-0.2
-0.1
-0.05
-0.025
-0.015
0
0.015
0.025
0.05
0.1
0.2
[-20.  -10.   -5.   -2.5  -1.5   0.    1.5   2.5   5.   10.   20. ]
Out[39]:
% lift A over B P(Choosing A) Median P(Unknown) Median Avg Time
0 -20.0 0.000 0.000 910.75
1 -10.0 0.005 0.000 2390.25
2 -5.0 0.041 0.000 6098.00
3 -2.5 0.110 0.065 12282.50
4 -1.5 0.159 0.137 14828.25
5 0.0 0.408 0.220 17409.25
6 1.5 0.683 0.125 14700.00
7 2.5 0.814 0.072 12815.50
8 5.0 0.964 0.001 6305.75
9 10.0 0.993 0.000 2521.50
10 20.0 1.000 0.000 1017.00

In the examples above I assumed we knew the CTR of the control banner B and that it was 0.2. This allowed us to run synthetic tests comparing banner B with a banner A with some lift over B. The point of the simulation was to see how large an effect we are likely discover and how long it might take.

In practice, we will not know the true value of B's CTR. We will neeed to estimate it either from historical data or by running it by itself for a while. One concern might be that the simulated results are quite sensitive to errors in estimation. To account for this we can form a confidence interval around CTR_B and use the lower and uppper bounds of the confidence interval as our CTR for B and see how the results differ. If the difference is too large, we will need to get more data on Banner B to form a tighter interval.


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [3]:
# Set Up Credible  AB Test

conf = 0.95

#in this case we need to estimate the max run time
mde = 0.05 
iters = 100
p_better, p_same, times = expected_results(CredibilityABTest,[SimStream(p_B.lift(mde)), SimStream(p_B), interval, float('inf'), conf], iters)
print "P(Choose A):",  p_better
print "Mean Run time: ", np.mean(times)
print "STD run Time: ", np.std(times)
max_run = np.percentile(times, 95)
print "Max Run Time: ", max_run

iters = 200
expected_results_by_lift(CredibilityABTest,[SimStream(p_A), SimStream(p_B), interval, max_run, conf], iters, p_hat, lifts)


P(Choose A): 0.94
Mean Run time:  6512.5
STD run Time:  7140.58602287
Max Run Time:  22550.0
-0.2
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-3-64d1355f9507> in <module>()
     14 
     15 iters = 200
---> 16 expected_results_by_lift(CredibilityABTest,[SimStream(p_A), SimStream(p_B), interval, max_run, conf], iters, p_hat, lifts)

/Users/ellerywulczyn/wmf/self_terminating_ab/src/python/abstract_abtest.pyc in expected_results_by_lift(TestClass, params, iters, p_hat, lifts, fig_name)
     97 
     98 
---> 99         p_better, p_unknown, time = expected_results(TestClass, params, iters)
    100         run_times_list.append(time)
    101         p_A_betters['lower'].append(p_better)

/Users/ellerywulczyn/wmf/self_terminating_ab/src/python/abstract_abtest.pyc in expected_results(TestClass, params, iters)
     48     for i in range(iters):
     49         t = TestClass(*params)
---> 50         result = t.run()
     51 
     52         if result == 'A':

/Users/ellerywulczyn/wmf/self_terminating_ab/src/python/abstract_abtest.pyc in run(self)
     31             a_records = self.a_stream.get_next_records(self.test_interval)
     32             b_records =self.b_stream.get_next_records(self.test_interval)
---> 33             self.a_estimator.update(a_records)
     34             self.b_estimator.update(b_records)
     35             result = self.evaluate_stopping_criterium()

/Users/ellerywulczyn/wmf/self_terminating_ab/src/python/bayesian_abtest.pyc in update(self, counts)
     41     def update(self, counts):
     42         self.counts += counts
---> 43         dist = dirichlet(self.counts+self.alpha, 10000).dot(self.values.transpose())
     44         self.distributions.append(dist)
     45         self.N = self.counts.sum()

KeyboardInterrupt: 

In [206]:
from numpy.random import binomial, multinomial, beta, dirichlet
import matplotlib.pyplot as plt
a_dist =  beta(7500,  378250, 1000000)
b_dist =  beta(7500,  378250, 1000000)

#print (a_dist - b_dist)/b_dist
#pd.Series((a_dist - b_dist)/b_dist).apply(lambda x: max(x, 0) )

In [207]:
lift_d = (a_dist - b_dist)/b_dist
plt.hist((a_dist - b_dist)/b_dist, bins = 100)
cost_d = pd.Series((a_dist - b_dist)/b_dist).apply(lambda x: max(x, 0) )
plt.hist(cost_d, bins = 100)
print "Cost of Choosing B:", cost_d.mean()
print "A_mean ", a_dist.mean()
print "B_mean:", b_dist.mean()
print "P(B_Better):", (b_dist>a_dist).mean()


Cost of Choosing B: 0.00651930530548
A_mean  0.0194428315244
B_mean: 0.0194426645613
P(B_Better): 0.499427